Applicants: FRIEDER et al. 
Appl. No. 09/629,175 

Remarks 

Applicants thank the Examiner for his careful consideration of their Application. 
Reconsideration of this Application is now respectfully requested. 

Upon entry of the foregoing Amendment, Claims 1-43 remain pending in the application, 
with Claims 1, 29, 30, 33, 34, 37, 38, and 42 being the independent claims. 

Based on the above Amendment and the following Remarks, Applicants respectfully 
request that the Examiner reconsider all outstanding objections and rejections and that they be 
withdrawn. 

Objections to the Specification 

At Pages 2 and 3, the Office Action objects to the specification as containing "embedded 
hyperlinks" at Page 2, lines 5 and 14. Applicants have reviewed the specification and have 
amended it to remove the hyperlink found at Page 2, line 14; however, Applicants have been 
unable to locate the other embedded hyperlink referred to. Applicants request that the Examiner 
provide more specific instructions as to what he would like them to amend or that the Examiner 
make such amendment by Examiner's Amendment after the application has been allowed. 

At Pages 2-3, the Office Action also cites the improper incorporation by reference of 
essential subject matter. Applicants do not understand this and request further clarification, as 
they are unaware of any such incorporation by reference of essential subject matter. 
Rejections under 35 U.S.C. § 112 

At Page 3, the Office Action rejects Claim 43 under 35 U.S.C. § 1 12, first paragraph, as 
not being enabled, based on a lack of support for the term, "semantic filtering." Applicants are 
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confused by this rejection, given the explicit support for Claim 43 described in their previous 
Amendment (of December 9, 2002). However, on the chance that the Office Action was merely 
objecting to the fact that the term does not appear, per se, in the disclosure, Applicants have 
amended the disclosure to explicitly use the term, "semantic filtering." 

At Page 4, the Office Action rejects Claim 43 under 35 U.S.C. § 1 12, second paragraph, 
as being indefinite, citing a lack of clarity as to how the "semantic filtering" is performed. 
Applicants are also confused by this rejection, again, given the specific support for this concept 
described in their previous Amendment, and request further clarification. 
Rejections under 35 U.S.C. § 102 and under 35 U.S.C. § 103 

The Office Action, at Pages 4-9, rejects Claims 1-6, 8, 10-14, 19-21, and 23-43 under 35 
U.S.C. § 102(e) as being anticipated by Aiken (U.S. Patent No. 6,240,409). At Pages 9-10, the 
Office Action rejects Claims 7, 9, 15-18, and 22 under 35 U.S.C. § 103(a) as being unpatentable 
over Aiken. Applicants respectfully traverse these rejections for the following reasons. 

The invention, for example, as claimed in Claims 1 and 29, is directed toward the 
detection of similar documents. A document is obtained and filtered, and a document identifier 
and hash value are determined for the document. The document identifier and hash value are 
used to generate a tuple corresponding to the filtered document. Note that independent Claims 1, 
29, 30, 33, 34, 37, 38, and 42 have now been amended to emphasize that only a single hash 
value is generated for each document (and a single tuple for each document, where the claim 
recites the generation of a tuple). The tuple is compared to other tuples in a document storage 
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structure. Similarity to another document is determined if the tuple is clustered with another 
tuple in the document storage structure. 

As discussed above, Claims 1, 29, 30, 33, 34, 37, 38, and 42 include the determination of 
"a single hash value" for the document (e.g., in Claims 1 and 29, "the filtered document"). In 
contrast, Aiken computes hash values for sub-strings, as discussed, for example, at Col. 5, lines 
10-35. That is, while the claimed invention determines hash values on a document basis (i.e., the 
generation of a single hash value for a document), Aiken computes hash values on a sub-string 
basis (i.e., multiple hash values for a given document). Therefore, Aiken does not disclose 
determination of "a single hash value" for a document. 

To elaborate further, Aiken recites, at Col. 4, lines 38-53, that a document, or "string," is 
translated into "a token string that represents and preserves the structure and content of the 
original or raw data string," where the translation is according to the type of string. For example, 
as described at Col. 4, lines 54 ff, an English language document may be translated by removing 
punctuation, spacing, and capitalization. The tokens include position data, describing the 
location of each token within the document. As further described at Col. 5, lines 10-32, sub- 
strings of a given length are formed from the translated string, and, as described at Col. 5, lines 
33-43, a hash function is applied to each sub-string. If the sub-string length were to be chosen as 
the length of the entire document, essential information (i.e., positions of tokens within the 
document) would be lost. In this case, the method of matching documents, as shown in Figs. 4a 
and 4b and described at Cols. 10 ff., would not function properly, as it relies on the use of such 
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position information. Therefore, the sub-string length can not be chosen to be equal to the 
entire document length; there must be multiple sub-strings for each document, and thus 
multiple hash values per document, if Aiken's method is to function properly. 

Similarly, Claims 1, 29, 30, and 33 include generating "a single tuple" for each 
document The claimed tuple comprises the document identifier for the document and the 
hash value for the document (i.e., a single tuple per document). In contrast, Aiken, noting, for 
example, Col. 4, line 65 to Col. 5, line 4, forms pairs consisting of a hash value for each sub- 
string and a position of the substring within the document. That is, Aiken's pairings are 
formed on a sub-string basis, not on a document basis, and thus produce multiple tuples for each 
document. Therefore, Aiken does not disclose generating "a single tuple for the filtered 
document." 

Furthermore, Claims 1, 29, 30, 33, 34, and 37 include comparing a tuple generated for a 
document with a plurality of tuples representing other documents. In contrast, the <hash 
value, position> pairs formed in Aiken are compared with other <hash value, position> pairs 
similarly formed, and thus corresponding to sub-strings, not documents. Therefore, Aiken 
does not disclose comparing a tuple with a plurality of other tuples, each representing one of a 
plurality of documents. 

It is respectfully submitted that, for at least these reasons, Claims 1, 29, 30, 33, 34, 37, 38, 
and 42 are allowable over the cited prior art. Hence, it is further submitted that Claims 2-28 and 
43, which depend from Claim 1, Claims 31 and 32, which depend from Claim 30, Claims 35 and 
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36, which depend from Claim 34, and Claims 39-41, which depend from Claim 38, are also 
allowable over the cited prior art. 

A number of additional arguments are also applicable to Claims 30 and 33. 

First, the invention claimed in Claims 30 and 33 includes retaining only retained tokens 
using at least one token threshold. The Office Action, noting the discussion of Claim 3 at Page 
4, cites Aiken at Col. 11, lines 15-30 as disclosing this limitation. However, Col. 11, lines 15-30 
address the retention of documents by comparing a match ratio for a document with a 
(document) threshold. While the match ratio is determined by comparing the hash values (of 
the sub-strings) of two documents and determining the ratio of the number of hash values in 
common to the total number of hash values in one of the documents (see, e.g., Aiken at Col. 11, 
lines 1-9), the threshold comparison is not performed on a token basis, but rather on a 
document basis. 

Second, the invention claimed in Claims 30 and 33 includes arranging retained tokens 
into an arranged token stream. Also, in Claims 30 and 33, subsequent processing, beginning 
with obtaining a hash value for a document, is performed on the tokens of the arranged token 
stream. The Office Action, noting the discussion of Claim 4 at Page 4, relies on Fig. 4a, step 
404 of Aiken for disclosure of such arranging; however, it is respectfully submitted that this can 
not correspond to the claimed arrangement of tokens. First, noting Col. 10, lines 49-54, step 404 
sorts pairs, which, as discussed above, comprise a hash value and a position, based on the 
position such that position values from a previously-indexed document are grouped together 
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sequentially. The pairs are not the tokens, as claimed. Second, step 404 can not correspond to 
this arranging of tokens because the tokens in Claims 30 and 33 lack position values on which 
step 404 is based. 

Finally, the invention as claimed in Claims 30 and 33 includes inserting a tuple for a 
document into a document storage tree and determining that the document is similar to another 
document if the tuple, after being inserted, is collocated with a tuple corresponding to another 
document. The Office Action, noting the discussion of Claim 26 at Pages 5-6, relies on Fig. 4c 
and Col. 8, lines 31-54 of Aiken for disclosure of these features. Applicants respectfully submit, 
however, that these portions of Aiken do not disclose the claimed features. Col. 8, lines 31-54 of 
Aiken addresses a data structure that stores names of documents, ranges of bytes in the 
corresponding documents, and, optionally, total numbers of hashes in the respective documents. 
As explained at Col. 8, lines 48-54, the <hash value, position> pairs of Aiken are not inserted 
into this data structure, but rather, this data structure is used to retrieve data for a pair based on 
the position. Col. 8, lines 31-54 does not address anything resembling the determination of 
similarity. 

Fig. 4c and its accompanying explanation at Col. 11, line 47 to Col. 12, line 2 of Aiken 
address clustering of a current document with existing clusters of documents using a union-find 
algorithm. However, this algorithm does not address collocated tuples, as claimed, in order to 
determine similarity of documents. Rather, the union-find algorithm of Aiken determines a set to 
which a document belongs and associates it with (merges it into) that set. 
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For at least these further reasons, it is respectfully submitted that Claims 30-33 are 
allowable over the cited prior art. 

The following further argument is applicable to Claims 34-37. These claims recite 
determining a single hash value for a document, accessing a document storage structure 
comprising hash values representing a plurality of documents, and detecting if a document is 
similar to another document in the storage structure based on determining if their hash values are 
equivalent. The Office Action relies on Aiken, noting Fig. 4c and Col. 11, line 47 to Col. 12, line 
2 for a disclosure of determining equivalent hash values. However, nowhere in the figure or in 
the cited text is there any mention of determining equivalence of hash values as a determination 
of similarity. 

For at least this further reason, it is respectfully submitted that Claims 34-37 are 
allowable over the cited prior art. 

The following further arguments are applicable to Claims 38-42. These claims recite 
comparing a document to documents in a document collection using a hash algorithm and 
collection statistics to detect if the document is similar to any of the documents in the document 
collection. The Office Action, in particular, cites Aiken, Fig. 4c and Col. 11, line 47 to Col. 12, 
line 2, as disclosing the use of collection statistics in clustering similar documents. However, 
there is no mention of any collection statistics in the cited figure or passage from Aiken. 
What is discussed is the use of a union-find algorithm. As discussed above, a union-find 
algorithm determines a set to which a document belongs and associates it with (i.e., merges it 
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into) that set. However, there is no disclosure in Aiken of a union-find algorithm using 
collection statistics. 

Also, noting Col. 11, lines 56-60, the comparison performed in the clustering of Aiken is 
a comparison between the current document and a single document (from an existing cluster), 
meaning that collection statistics would not be used, as in the claimed invention. 

It is further noted that, at Col. 2, lines 36-41 and 47 (the latter line establishing a 
connection between Aiken's disclosed invention and the desirable properties described in the 
former lines), Aiken specifically teaches away from the use of "probability for measuring 
comparison accuracy." As statistics are probabilistic in nature, this would teach away from the 
use of collection statistics. 

For at least these reasons, it is respectfully submitted that Claims 38 and 42 are allowable 
over the cited prior art. It is further submitted that Claims 39-41, which depend from Claim 38, 
are thus also allowable over the cited prior art. 

In view of the above, it is respectfully submitted that all of Claims 1-43 are allowable 
over the cited prior art in view of the allowability of all of the independent claims. Applicants 
additionally reiterate the following additional arguments regarding various dependent claims: 

• Claim 3 includes "retaining a token in the token stream as a retained token according 
to at least one token threshold." Claim 3 is thus further allowable over the cited prior 
art in view of the argument discussed above in connection with Claim 30 in 
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connection with a similar limitation, in addition to the arguments made in connection 
with Claim 1. 

Claim 4 includes "arranging the retained tokens in the token stream to obtain an 
arranged token stream." Claim 4 is thus further allowable over the cited prior art in 
view of the argument discussed above in connection with Claim 30 in connection 
with a similar limitation, in addition to the arguments made in connection with 
Claims 1 and 3. 

Claim 5 includes "determining a hash value [for a filtered document] by individually 
processing each retained token in the token stream." In addition to the fact that Aiken 
does not determine a hash value for a document (but, instead, determines hash values 
for sub-strings), Aiken, in the cited passages (Col. 6, lines 7-28 and Col. 9, lines 24- 
26), applies hashing to all sub-strings, not only to retained tokens. Claim 5 is thus 
allowable over the cited prior art for this further reason, in addition to the arguments 
made in connection with Claims 1 and 3. 

Claim 6 recites a step of "determining a score for each token in the token stream" and 
a step of "comparing the score for each token to a first token threshold." Claim 7 
depends from Claim 6 and recites a further step of "comparing the score for each 
retained token [i.e., in a further step of Claim 6 noted below] to a second token 
threshold." These claims thus add limitations similar to the limitation added by 
Claim 3, so the above arguments with respect to Claim 3 (and Claim 30), therefore, 
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also apply to these claims. Furthermore, the Office Action cites Col. 11, lines 15-30 
of Aiken as disclosing the limitations of Claim 6, including the step of modifying the 
token stream;" however, as the cited passage is concerned with comparing and 
discarding documents, not tokens, it does not disclose this limitation. Therefore, 
Claims 6 and 7 are allowable over the cited prior art for these further reasons, in 
addition to the arguments made in connection with Claim 1 . 
Claim 10, as amended, now recites the step of "removing a token from the token 
stream based on collection statistics and at least one token threshold." As previously 
discussed, neither collection statistics (discussed in connection with Claim 38) nor 
token thresholds (discussed in connection with Claim 30) are used in Aiken. Hence, 
Claim 10 is allowable over the cited prior art for these further reasons, in addition to 
the arguments made in connection with Claim 1 . 

Claims 13 and 14 are directed to the use of collection statistics for filtering 
documents. The arguments presented above in connection with a similar limitation in 
Claim 38 are thus applicable to these claims, providing a further basis (in addition to 
the arguments made in connection with Claim 1) for the allowability of Claims 13 and 
14 over the cited prior art. 

Claim 14 recites that "the collection statistics [used in filtering the document] pertain 
to the plurality of documents." In contrast, noting Col. 1 1, lines 1-14, Aiken uses 
pairwise comparisons between two documents, not collection statistics based on a 
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plurality of documents. Claim 14 is, therefore, allowable over the cited prior art for 
this further reason, in addition to the reasons discussed above. 

• Claim 39 recites that "the collection statistics [used in the comparing step of Claim 
38] pertain to the document collection [discussed in Claim 38]." In particular, Claim 
38, from which Claim 39 depends, recites "a plurality of documents in a document 
collection." Therefore, the argument presented in connection with Claim 14, in the 
immediately preceding paragraph, also provides a further basis for the allowability of 
Claim 39 over the cited prior art, in addition to the arguments made in connection 
with Claim 38. 

• Claims 25 and 26 are directed to storage structures and to the determination of 
similarity of documents whose tuples are collocated in the storage structures. The 
arguments presented in connection with a similar limitation in Claim 30 are, 
therefore, also applicable to these claims, thus providing a further basis for 
allowability of Claims 25 and 26 over the cited prior art, in addition to the arguments 
made in connection with Claim 1 . 

• Claim 43 recites the use of semantic filtering. Aiken is directed to the use of syntactic 
filtering. Hence, this provides a further basis for the allowability of Claim 43 over the 
cited prior art, in addition to the arguments made in connection with Claim 1 . 
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Conclusion 

All of the stated grounds of objection and rejection have been properly traversed, 
accommodated, or rendered moot. Applicants, therefore, respectfully request that the Examiner 
reconsider all presently outstanding objections and rejections and that they be withdrawn. 
Applicants believe that a full and complete reply has been made to the outstanding Office Action 
and, as such, the present application is in condition for allowance. If the Examiner believes, for 
any reason, that personal communication will expedite prosecution of this application, the 
Examiner is hereby invited to telephone the undersigned at the number provided. 

Prompt and favorable consideration of this Amendment is respectfully requested. 

Respectfully submitted, 



Date: June 5, 2003 
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